{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "norman-orlando",
   "metadata": {},
   "source": [
    "# Mexican Personal Identifiers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sealed-purchase",
   "metadata": {},
   "source": [
    "## Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "delayed-sword",
   "metadata": {},
   "source": [
    "The function `clean_mx_curp()` cleans a column containing Mexican personal identifier (CURP) strings, and standardizes them in a given format. The function `validate_mx_curp()` validates either a single CURP strings, a column of CURP strings or a DataFrame of CURP strings, returning `True` if the value is valid, and `False` otherwise."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "gothic-thirty",
   "metadata": {},
   "source": [
    "CURP strings can be converted to the following formats via the `output_format` parameter:\n",
    "\n",
    "* `compact`: only number strings without any seperators or whitespace, like \"BOXW310820HNERXN09\"\n",
    "* `standard`: CURP strings with proper whitespace in the proper places. Note that in the case of CURP, the compact format is the same as the standard one.\n",
    "* `birthdate`: split the date parts from the number and return the birth date, like \"1875-03-16\".\n",
    "* `gender`: get the person's birth gender ('M' or 'F').\n",
    "\n",
    "Invalid parsing is handled with the `errors` parameter:\n",
    "\n",
    "* `coerce` (default): invalid parsing will be set to NaN\n",
    "* `ignore`: invalid parsing will return the input\n",
    "* `raise`: invalid parsing will raise an exception\n",
    "\n",
    "The following sections demonstrate the functionality of `clean_mx_curp()` and `validate_mx_curp()`. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "grateful-resistance",
   "metadata": {},
   "source": [
    "### An example dataset containing CURP strings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "extra-value",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"curp\": [\n",
    "            'BOXW310820HNERXN09',\n",
    "            'BOXW310820HNERXN08',\n",
    "            '7542011030',\n",
    "            '7552A10004',\n",
    "            '8019010008',\n",
    "            \"hello\",\n",
    "            np.nan,\n",
    "            \"NULL\",\n",
    "        ], \n",
    "        \"address\": [\n",
    "            \"123 Pine Ave.\",\n",
    "            \"main st\",\n",
    "            \"1234 west main heights 57033\",\n",
    "            \"apt 1 789 s maple rd manhattan\",\n",
    "            \"robie house, 789 north main street\",\n",
    "            \"1111 S Figueroa St, Los Angeles, CA 90015\",\n",
    "            \"(staples center) 1111 S Figueroa St, Los Angeles\",\n",
    "            \"hello\",\n",
    "        ]\n",
    "    }\n",
    ")\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "marked-width",
   "metadata": {},
   "source": [
    "## 1. Default `clean_mx_curp`\n",
    "\n",
    "By default, `clean_mx_curp` will clean curp strings and output them in the standard format with proper separators."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "trying-wonder",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_mx_curp\n",
    "clean_mx_curp(df, column = \"curp\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "opponent-shooting",
   "metadata": {},
   "source": [
    "## 2. Output formats"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "respected-democracy",
   "metadata": {},
   "source": [
    "This section demonstrates the output parameter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "olive-fifty",
   "metadata": {},
   "source": [
    "### `standard` (default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "spare-configuration",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_mx_curp(df, column = \"curp\", output_format=\"standard\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "norwegian-research",
   "metadata": {},
   "source": [
    "### `compact`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "annoying-folder",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_mx_curp(df, column = \"curp\", output_format=\"compact\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "antique-timer",
   "metadata": {},
   "source": [
    "### `birthdate`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "modular-perception",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_mx_curp(df, column = \"curp\", output_format=\"birthdate\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "egyptian-pepper",
   "metadata": {},
   "source": [
    "### `gender`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "norwegian-calculation",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_mx_curp(df, column = \"curp\", output_format=\"gender\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "direct-original",
   "metadata": {},
   "source": [
    "## 3. `inplace` parameter\n",
    "\n",
    "This deletes the given column from the returned DataFrame. \n",
    "A new column containing cleaned CURP strings is added with a title in the format `\"{original title}_clean\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "municipal-answer",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_mx_curp(df, column=\"curp\", inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "thick-analyst",
   "metadata": {},
   "source": [
    "## 4. `errors` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "willing-bridges",
   "metadata": {},
   "source": [
    "### `coerce` (default)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "distant-resistance",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_mx_curp(df, \"curp\", errors=\"coerce\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "legitimate-ebony",
   "metadata": {},
   "source": [
    "### `ignore`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "driving-bradford",
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_mx_curp(df, \"curp\", errors=\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "sustained-likelihood",
   "metadata": {},
   "source": [
    "## 4. `validate_mx_curp()`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "laden-devon",
   "metadata": {},
   "source": [
    "`validate_mx_curp()` returns `True` when the input is a valid CURP. Otherwise it returns `False`.\n",
    "\n",
    "The input of `validate_mx_curp()` can be a string, a Pandas DataSeries, a Dask DataSeries, a Pandas DataFrame and a dask DataFrame.\n",
    "\n",
    "When the input is a string, a Pandas DataSeries or a Dask DataSeries, user doesn't need to specify a column name to be validated. \n",
    "\n",
    "When the input is a Pandas DataFrame or a dask DataFrame, user can both specify or not specify a column name to be validated. If user specify the column name, `validate_mx_curp()` only returns the validation result for the specified column. If user doesn't specify the column name, `validate_mx_curp()` returns the validation result for the whole DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "equivalent-cambodia",
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_mx_curp\n",
    "print(validate_mx_curp('BOXW310820HNERXN09'))\n",
    "print(validate_mx_curp('BOXW310820HNERXN08'))\n",
    "print(validate_mx_curp('7542011030'))\n",
    "print(validate_mx_curp('7552A10004'))\n",
    "print(validate_mx_curp('8019010008'))\n",
    "print(validate_mx_curp(\"hello\"))\n",
    "print(validate_mx_curp(np.nan))\n",
    "print(validate_mx_curp(\"NULL\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "latin-toronto",
   "metadata": {},
   "source": [
    "### Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "detected-reserve",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_mx_curp(df[\"curp\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "alleged-answer",
   "metadata": {},
   "source": [
    "### DataFrame + Specify Column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "signed-wayne",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_mx_curp(df, column=\"curp\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "worst-hormone",
   "metadata": {},
   "source": [
    "### Only DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "assumed-works",
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_mx_curp(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dirty-collector",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
